{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第三课 语言模型\n",
    "\n",
    "褚则伟 zeweichu@gmail.com\n",
    "\n",
    "学习目标\n",
    "- 学习语言模型，以及如何训练一个语言模型\n",
    "- 学习torchtext的基本使用方法\n",
    "    - 构建 vocabulary\n",
    "    - word to inde 和 index to word\n",
    "- 学习torch.nn的一些基本模型\n",
    "    - Linear\n",
    "    - RNN\n",
    "    - LSTM\n",
    "    - GRU\n",
    "- RNN的训练技巧\n",
    "    - Gradient Clipping\n",
    "- 如何保存和读取模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们会使用 [torchtext](https://github.com/pytorch/text) 来创建vocabulary, 然后把数据读成batch的格式。请大家自行阅读README来学习torchtext。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import torchtext\n",
    "from torchtext.vocab import Vectors\n",
    "import torch\n",
    "import numpy as np\n",
    "import random\n",
    "\n",
    "USE_CUDA = torch.cuda.is_available()\n",
    "\n",
    "# 为了保证实验结果可以复现，我们经常会把各种random seed固定在某一个值\n",
    "random.seed(53113)\n",
    "np.random.seed(53113)\n",
    "torch.manual_seed(53113)\n",
    "if USE_CUDA:\n",
    "    torch.cuda.manual_seed(53113)\n",
    "\n",
    "BATCH_SIZE = 32\n",
    "EMBEDDING_SIZE = 650\n",
    "MAX_VOCAB_SIZE = 50000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 我们会继续使用上次的text8作为我们的训练，验证和测试数据\n",
    "- TorchText的一个重要概念是`Field`，它决定了你的数据会如何被处理。我们使用`TEXT`这个field来处理文本数据。我们的`TEXT` field有`lower=True`这个参数，所以所有的单词都会被lowercase。\n",
    "- torchtext提供了LanguageModelingDataset这个class来帮助我们处理语言模型数据集。\n",
    "- `build_vocab`可以根据我们提供的训练数据集来创建最高频单词的单词表，`max_size`帮助我们限定单词总量。\n",
    "- BPTTIterator可以连续地得到连贯的句子，BPTT的全程是back propagation through time。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.\n",
      "The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.\n",
      "The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "vocabulary size: 50002\n"
     ]
    }
   ],
   "source": [
    "TEXT = torchtext.data.Field(lower=True)\n",
    "train, val, test = torchtext.datasets.LanguageModelingDataset.splits(path=\".\", \n",
    "    train=\"text8.train.txt\", validation=\"text8.dev.txt\", test=\"text8.test.txt\", text_field=TEXT)\n",
    "TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE)\n",
    "print(\"vocabulary size: {}\".format(len(TEXT.vocab)))\n",
    "\n",
    "VOCAB_SIZE = len(TEXT.vocab)\n",
    "train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits(\n",
    "    (train, val, test), batch_size=BATCH_SIZE, device=-1, bptt_len=32, repeat=False, shuffle=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 为什么我们的单词表有50002个单词而不是50000呢？因为TorchText给我们增加了两个特殊的token，`<unk>`表示未知的单词，`<pad>`表示padding。\n",
    "- 模型的输入是一串文字，模型的输出也是一串文字，他们之间相差一个位置，因为语言模型的目标是根据之前的单词预测下一个单词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "had dropped to just three zero zero zero k it was then cool enough to allow the nuclei to capture electrons this process is called recombination during which the first neutral atoms\n",
      "dropped to just three zero zero zero k it was then cool enough to allow the nuclei to capture electrons this process is called recombination during which the first neutral atoms took\n"
     ]
    }
   ],
   "source": [
    "it = iter(train_iter)\n",
    "batch = next(it)\n",
    "print(\" \".join([TEXT.vocab.itos[i] for i in batch.text[:,1].data]))\n",
    "print(\" \".join([TEXT.vocab.itos[i] for i in batch.target[:,1].data]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "under the aegis of the emperor thus reducing the local identity and autonomy of the different regions of japan as japanese citizens the ainu are now governed by japanese laws though one\n",
      "the aegis of the emperor thus reducing the local identity and autonomy of the different regions of japan as japanese citizens the ainu are now governed by japanese laws though one ainu\n",
      "ainu man was acquitted of murder because he asserted that he was not a japanese citizen and the judge agreed and judged by japanese tribunals but in the past their affairs were\n",
      "man was acquitted of murder because he asserted that he was not a japanese citizen and the judge agreed and judged by japanese tribunals but in the past their affairs were administered\n",
      "administered by hereditary chiefs three in each village and for administrative purposes the country was divided into three districts <unk> <unk> and <unk> which were under the ultimate control of <unk> though\n",
      "by hereditary chiefs three in each village and for administrative purposes the country was divided into three districts <unk> <unk> and <unk> which were under the ultimate control of <unk> though the\n",
      "the relations between their respective inhabitants were not close and intermarriages were avoided the functions of judge were not entrusted to these chiefs an indefinite number of a community s members sat\n",
      "relations between their respective inhabitants were not close and intermarriages were avoided the functions of judge were not entrusted to these chiefs an indefinite number of a community s members sat in\n",
      "in judgement upon its criminals capital punishment did not exist nor was imprisonment resorted to beating being considered a sufficient and final penalty except in the case of murder when the nose\n",
      "judgement upon its criminals capital punishment did not exist nor was imprisonment resorted to beating being considered a sufficient and final penalty except in the case of murder when the nose and\n"
     ]
    }
   ],
   "source": [
    "for i in range(5):\n",
    "    batch = next(it)\n",
    "    print(\" \".join([TEXT.vocab.itos[i] for i in batch.text[:,2].data]))\n",
    "    print(\" \".join([TEXT.vocab.itos[i] for i in batch.target[:,2].data]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 定义模型\n",
    "\n",
    "- 继承nn.Module\n",
    "- 初始化函数\n",
    "- forward函数\n",
    "- 其余可以根据模型需要定义相关的函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "\n",
    "\n",
    "class RNNModel(nn.Module):\n",
    "    \"\"\" 一个简单的循环神经网络\"\"\"\n",
    "\n",
    "    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5):\n",
    "        ''' 该模型包含以下几层:\n",
    "            - 词嵌入层\n",
    "            - 一个循环神经网络层(RNN, LSTM, GRU)\n",
    "            - 一个线性层，从hidden state到输出单词表\n",
    "            - 一个dropout层，用来做regularization\n",
    "        '''\n",
    "        super(RNNModel, self).__init__()\n",
    "        self.drop = nn.Dropout(dropout)\n",
    "        self.encoder = nn.Embedding(ntoken, ninp)\n",
    "        if rnn_type in ['LSTM', 'GRU']:\n",
    "            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)\n",
    "        else:\n",
    "            try:\n",
    "                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]\n",
    "            except KeyError:\n",
    "                raise ValueError( \"\"\"An invalid option for `--model` was supplied,\n",
    "                                 options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']\"\"\")\n",
    "            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)\n",
    "        self.decoder = nn.Linear(nhid, ntoken)\n",
    "\n",
    "        self.init_weights()\n",
    "\n",
    "        self.rnn_type = rnn_type\n",
    "        self.nhid = nhid\n",
    "        self.nlayers = nlayers\n",
    "\n",
    "    def init_weights(self):\n",
    "        initrange = 0.1\n",
    "        self.encoder.weight.data.uniform_(-initrange, initrange)\n",
    "        self.decoder.bias.data.zero_()\n",
    "        self.decoder.weight.data.uniform_(-initrange, initrange)\n",
    "\n",
    "    def forward(self, input, hidden):\n",
    "        ''' Forward pass:\n",
    "            - word embedding\n",
    "            - 输入循环神经网络\n",
    "            - 一个线性层从hidden state转化为输出单词表\n",
    "        '''\n",
    "        emb = self.drop(self.encoder(input))\n",
    "        output, hidden = self.rnn(emb, hidden)\n",
    "        output = self.drop(output)\n",
    "        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))\n",
    "        return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden\n",
    "\n",
    "    def init_hidden(self, bsz, requires_grad=True):\n",
    "        weight = next(self.parameters())\n",
    "        if self.rnn_type == 'LSTM':\n",
    "            return (weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad),\n",
    "                    weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad))\n",
    "        else:\n",
    "            return weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "初始化一个模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = RNNModel(\"LSTM\", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, 2, dropout=0.5)\n",
    "if USE_CUDA:\n",
    "    model = model.cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 我们首先定义评估模型的代码。\n",
    "- 模型的评估和模型的训练逻辑基本相同，唯一的区别是我们只需要forward pass，不需要backward pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def evaluate(model, data):\n",
    "    model.eval()\n",
    "    total_loss = 0.\n",
    "    it = iter(data)\n",
    "    total_count = 0.\n",
    "    with torch.no_grad():\n",
    "        hidden = model.init_hidden(BATCH_SIZE, requires_grad=False)\n",
    "        for i, batch in enumerate(it):\n",
    "            data, target = batch.text, batch.target\n",
    "            if USE_CUDA:\n",
    "                data, target = data.cuda(), target.cuda()\n",
    "            hidden = repackage_hidden(hidden)\n",
    "            with torch.no_grad():\n",
    "                output, hidden = model(data, hidden)\n",
    "            loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))\n",
    "            total_count += np.multiply(*data.size())\n",
    "            total_loss += loss.item()*np.multiply(*data.size())\n",
    "            \n",
    "    loss = total_loss / total_count\n",
    "    model.train()\n",
    "    return loss\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们需要定义下面的一个function，帮助我们把一个hidden state和计算图之前的历史分离。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Remove this part\n",
    "def repackage_hidden(h):\n",
    "    \"\"\"Wraps hidden states in new Tensors, to detach them from their history.\"\"\"\n",
    "    if isinstance(h, torch.Tensor):\n",
    "        return h.detach()\n",
    "    else:\n",
    "        return tuple(repackage_hidden(v) for v in h)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "定义loss function和optimizer\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "loss_fn = nn.CrossEntropyLoss()\n",
    "learning_rate = 0.001\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "训练模型：\n",
    "- 模型一般需要训练若干个epoch\n",
    "- 每个epoch我们都把所有的数据分成若干个batch\n",
    "- 把每个batch的输入和输出都包装成cuda tensor\n",
    "- forward pass，通过输入的句子预测每个单词的下一个单词\n",
    "- 用模型的预测和正确的下一个单词计算cross entropy loss\n",
    "- 清空模型当前gradient\n",
    "- backward pass\n",
    "- gradient clipping，防止梯度爆炸\n",
    "- 更新模型参数\n",
    "- 每隔一定的iteration输出模型在当前iteration的loss，以及在验证集上做模型的评估"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "epoch 0 iter 0 loss 10.821578979492188\n",
      "best model, val loss:  10.782116411285918\n",
      "epoch 0 iter 1000 loss 6.5122528076171875\n",
      "epoch 0 iter 2000 loss 6.3599748611450195\n",
      "epoch 0 iter 3000 loss 6.13856315612793\n",
      "epoch 0 iter 4000 loss 5.473214626312256\n",
      "epoch 0 iter 5000 loss 5.901871204376221\n",
      "epoch 0 iter 6000 loss 5.85321569442749\n",
      "epoch 0 iter 7000 loss 5.636535167694092\n",
      "epoch 0 iter 8000 loss 5.7489800453186035\n",
      "epoch 0 iter 9000 loss 5.464158058166504\n",
      "epoch 0 iter 10000 loss 5.554863452911377\n",
      "best model, val loss:  5.264891533569864\n",
      "epoch 0 iter 11000 loss 5.703625202178955\n",
      "epoch 0 iter 12000 loss 5.6448974609375\n",
      "epoch 0 iter 13000 loss 5.372857570648193\n",
      "epoch 0 iter 14000 loss 5.2639479637146\n",
      "epoch 1 iter 0 loss 5.696778297424316\n",
      "best model, val loss:  5.124550380139679\n",
      "epoch 1 iter 1000 loss 5.534722805023193\n",
      "epoch 1 iter 2000 loss 5.599489212036133\n",
      "epoch 1 iter 3000 loss 5.459986686706543\n",
      "epoch 1 iter 4000 loss 4.927192211151123\n",
      "epoch 1 iter 5000 loss 5.435710906982422\n",
      "epoch 1 iter 6000 loss 5.4059576988220215\n",
      "epoch 1 iter 7000 loss 5.308575630187988\n",
      "epoch 1 iter 8000 loss 5.405811786651611\n",
      "epoch 1 iter 9000 loss 5.1389055252075195\n",
      "epoch 1 iter 10000 loss 5.226413726806641\n",
      "best model, val loss:  4.946829228873176\n",
      "epoch 1 iter 11000 loss 5.379891395568848\n",
      "epoch 1 iter 12000 loss 5.360724925994873\n",
      "epoch 1 iter 13000 loss 5.176026344299316\n",
      "epoch 1 iter 14000 loss 5.110936641693115\n"
     ]
    }
   ],
   "source": [
    "import copy\n",
    "GRAD_CLIP = 1.\n",
    "NUM_EPOCHS = 2\n",
    "\n",
    "val_losses = []\n",
    "for epoch in range(NUM_EPOCHS):\n",
    "    model.train()\n",
    "    it = iter(train_iter)\n",
    "    hidden = model.init_hidden(BATCH_SIZE)\n",
    "    for i, batch in enumerate(it):\n",
    "        data, target = batch.text, batch.target\n",
    "        if USE_CUDA:\n",
    "            data, target = data.cuda(), target.cuda()\n",
    "        hidden = repackage_hidden(hidden)\n",
    "        model.zero_grad()\n",
    "        output, hidden = model(data, hidden)\n",
    "        loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))\n",
    "        loss.backward()\n",
    "        torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)\n",
    "        optimizer.step()\n",
    "        if i % 1000 == 0:\n",
    "            print(\"epoch\", epoch, \"iter\", i, \"loss\", loss.item())\n",
    "    \n",
    "        if i % 10000 == 0:\n",
    "            val_loss = evaluate(model, val_iter)\n",
    "            \n",
    "            if len(val_losses) == 0 or val_loss < min(val_losses):\n",
    "                print(\"best model, val loss: \", val_loss)\n",
    "                torch.save(model.state_dict(), \"lm-best.th\")\n",
    "            else:\n",
    "                scheduler.step()\n",
    "                optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)\n",
    "            val_losses.append(val_loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "best_model = RNNModel(\"LSTM\", VOCAB_SIZE, EMBEDDING_SIZE, EMBEDDING_SIZE, 2, dropout=0.5)\n",
    "if USE_CUDA:\n",
    "    best_model = best_model.cuda()\n",
    "best_model.load_state_dict(torch.load(\"lm-best.th\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用最好的模型在valid数据上计算perplexity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "perplexity:  140.72803934425724\n"
     ]
    }
   ],
   "source": [
    "val_loss = evaluate(best_model, val_iter)\n",
    "print(\"perplexity: \", np.exp(val_loss))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用最好的模型在测试数据上计算perplexity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "perplexity:  178.54742013696125\n"
     ]
    }
   ],
   "source": [
    "test_loss = evaluate(best_model, test_iter)\n",
    "print(\"perplexity: \", np.exp(test_loss))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用训练好的模型生成一些句子。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "s influence clinton decision de gaulle is himself sappho s iv one family banquet was made published by paul <unk> and by a persuaded to prevent arcane of animate poverty based at copernicus bachelor in search services and in a cruise corps references eds the robin series july four one nine zero eight summer gutenberg one nine six four births one nine two eight deaths timeline of this method by the fourth amendment the german ioc known for his <unk> from <unk> one eight nine eight one seven eight nine management was established in one nine seven zero they had\n"
     ]
    }
   ],
   "source": [
    "hidden = best_model.init_hidden(1)\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "input = torch.randint(VOCAB_SIZE, (1, 1), dtype=torch.long).to(device)\n",
    "words = []\n",
    "for i in range(100):\n",
    "    output, hidden = best_model(input, hidden)\n",
    "    word_weights = output.squeeze().exp().cpu()\n",
    "    word_idx = torch.multinomial(word_weights, 1)[0]\n",
    "    input.fill_(word_idx)\n",
    "    word = TEXT.vocab.itos[word_idx]\n",
    "    words.append(word)\n",
    "print(\" \".join(words))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
